Non-Personalized Recommenders Assignment

Overview

This assignment will explore non-personalized recommendations. You will be given a 20x20 matrix where columns represent movies, rows represent users, and each cell represents a user-movie rating.

Deliverables

There are 4 deliverables for this assignment. Each deliverable represents a different analysis of the data provided to you. For each deliverable, you will submit a list of the top 5 movies as ranked by a particular metric. The 4 metrics are:

  1. Mean Rating: Calculate the mean rating for each movie, order with the highest rating listed first, and submit the top 5.
  2. % of ratings 4+: Calculate the percentage of ratings for each movie that are 4 or higher. Order with the highest percentage first, and submit the top 5.
  3. Rating Count: Count the number of ratings for each movie, order with the most number of ratings first, and submit the top 5.
  4. Top 5 Star Wars: Calculate movies that most often occur with Star Wars: Episode IV - A New Hope (1977) using the (x+y)/x method described in class. In other words, for each movie, calculate the percentage of Star Wars raters who also rated that movie. Order with the highest percentage first, and submit the top 5.

Importing Libraries


In [114]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

Loading the Data


In [115]:
# Loading the data into a Pandas dataframe
movie_data = pd.read_csv('A1Ratings.csv')

In [116]:
# Looking at the first 5 rows of the dataframe
movie_data.head()


Out[116]:
User 260: Star Wars: Episode IV - A New Hope (1977) 1210: Star Wars: Episode VI - Return of the Jedi (1983) 356: Forrest Gump (1994) 318: Shawshank Redemption, The (1994) 593: Silence of the Lambs, The (1991) 3578: Gladiator (2000) 1: Toy Story (1995) 2028: Saving Private Ryan (1998) 296: Pulp Fiction (1994) ... 2396: Shakespeare in Love (1998) 2916: Total Recall (1990) 780: Independence Day (ID4) (1996) 541: Blade Runner (1982) 1265: Groundhog Day (1993) 2571: Matrix, The (1999) 527: Schindler's List (1993) 2762: Sixth Sense, The (1999) 1198: Raiders of the Lost Ark (1981) 34: Babe (1995)
0 755 1 5 2 NaN 4 4 2 2 NaN ... 2 NaN 5 2 NaN 4 2 5 NaN NaN
1 5277 5 3 NaN 2 4 2 1 NaN NaN ... 3 2 2 NaN 2 NaN 5 1 3 NaN
2 1577 NaN NaN NaN 5 2 NaN 4 NaN NaN ... NaN 1 4 4 1 1 2 3 1 3
3 4388 NaN 3 NaN NaN NaN 1 2 3 4 ... NaN 4 1 3 5 NaN 5 1 1 2
4 1202 4 3 4 1 4 1 NaN 4 NaN ... 5 1 NaN 4 NaN 3 5 5 NaN NaN

5 rows × 21 columns


In [117]:
#printing the column names of the dataframe
movie_data.columns


Out[117]:
Index([u'User', u'260: Star Wars: Episode IV - A New Hope (1977)',
       u'1210: Star Wars: Episode VI - Return of the Jedi (1983)',
       u'356: Forrest Gump (1994)', u'318: Shawshank Redemption, The (1994)',
       u'593: Silence of the Lambs, The (1991)', u'3578: Gladiator (2000)',
       u'1: Toy Story (1995)', u'2028: Saving Private Ryan (1998)',
       u'296: Pulp Fiction (1994)', u'1259: Stand by Me (1986)',
       u'2396: Shakespeare in Love (1998)', u'2916: Total Recall (1990)',
       u'780: Independence Day (ID4) (1996)', u'541: Blade Runner (1982)',
       u'1265: Groundhog Day (1993)', u'2571: Matrix, The (1999)',
       u'527: Schindler's List (1993)', u'2762: Sixth Sense, The (1999)',
       u'1198: Raiders of the Lost Ark (1981)', u'34: Babe (1995)'],
      dtype='object')

In [118]:
# Summarizing the data in the movie_data dataframe
movie_data.describe()


Out[118]:
User 260: Star Wars: Episode IV - A New Hope (1977) 1210: Star Wars: Episode VI - Return of the Jedi (1983) 356: Forrest Gump (1994) 318: Shawshank Redemption, The (1994) 593: Silence of the Lambs, The (1991) 3578: Gladiator (2000) 1: Toy Story (1995) 2028: Saving Private Ryan (1998) 296: Pulp Fiction (1994) ... 2396: Shakespeare in Love (1998) 2916: Total Recall (1990) 780: Independence Day (ID4) (1996) 541: Blade Runner (1982) 1265: Groundhog Day (1993) 2571: Matrix, The (1999) 527: Schindler's List (1993) 2762: Sixth Sense, The (1999) 1198: Raiders of the Lost Ark (1981) 34: Babe (1995)
count 20.000000 15.000000 14.000000 10.000000 10.000000 16.00000 12.000000 17.000000 11.000000 11.000000 ... 11.000000 12.000000 13.000000 9.000000 12.000000 12.000000 12.000000 12.000000 11.000000 10.000000
mean 3658.100000 3.266667 3.000000 2.700000 3.600000 3.06250 2.916667 2.823529 3.000000 3.000000 ... 2.909091 1.916667 2.769231 3.222222 3.166667 2.833333 3.000000 2.833333 2.909091 3.000000
std 1749.716756 1.387015 1.467599 1.337494 1.646545 1.28938 1.564279 1.131111 1.414214 1.183216 ... 1.513575 0.996205 1.235168 1.092906 1.585923 1.527525 1.595448 1.642245 1.578261 1.414214
min 139.000000 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 2558.750000 2.000000 2.000000 2.000000 2.500000 2.00000 1.750000 2.000000 2.000000 2.000000 ... 2.000000 1.000000 2.000000 2.000000 2.000000 1.750000 2.000000 1.000000 1.500000 2.000000
50% 4252.500000 4.000000 3.000000 2.500000 4.000000 3.00000 3.000000 2.000000 3.000000 3.000000 ... 3.000000 2.000000 3.000000 3.000000 3.000000 2.500000 2.500000 3.000000 3.000000 2.500000
75% 4916.250000 4.000000 4.000000 3.750000 5.000000 4.00000 4.000000 4.000000 4.000000 4.000000 ... 4.000000 2.250000 4.000000 4.000000 5.000000 4.000000 5.000000 4.250000 4.000000 4.000000
max 6037.000000 5.000000 5.000000 5.000000 5.000000 5.00000 5.000000 5.000000 5.000000 5.000000 ... 5.000000 4.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000

8 rows × 21 columns

Non-Personalized Recommenders for Raiders of the Lost Ark


In [119]:
# Storing the "1198: Raiders of the Lost Ark (1981)" data into an array
raid_lost_arc = movie_data["1198: Raiders of the Lost Ark (1981)"]
raid_lost_arc


Out[119]:
0    NaN
1      3
2      1
3      1
4    NaN
5    NaN
6      5
7      5
8    NaN
9    NaN
10     1
11   NaN
12     5
13   NaN
14     3
15     3
16   NaN
17     2
18   NaN
19     3
Name: 1198: Raiders of the Lost Ark (1981), dtype: float64

Mean rating for Raiders of the Lost Ark (1981)


In [120]:
print '%.2f' % ( raid_lost_arc.mean() )


2.91

Number of non-NA ratings for Raiders of the Lost Ark (1981)


In [121]:
raid_lost_arc.count()


Out[121]:
11

Percentage of ratings >=4 for Raiders of the Lost Ark (1981)


In [122]:
print '%.1f' % ( (len(raid_lost_arc[raid_lost_arc>=4])/float(raid_lost_arc.count()))*100.0 )


27.3

Finding Association of Raiders of the Lost Ark (1981) with Star Wars Episode IV. The association with Star Wars Episode IV is defined as the number of users that rated BOTH Raiders of the Lost Ark (1981) and Star Wars Episode IV divided by the number of users that rated Star Wars Episode IV.


In [123]:
# First, storing the Star Wars count
star_wars_count = movie_data["260: Star Wars: Episode IV - A New Hope (1977)"].count()

In [124]:
# Then multiply the Raiders of the Lost Ark and Star Wars data.
# non-NA values will be the ones where both entries do not have NA. Then, count these entries
rad_arc_star_wars_count = (movie_data["1198: Raiders of the Lost Ark (1981)"]*movie_data["260: Star Wars: Episode IV - A New Hope (1977)"]).count()

Printing the Association of Raiders of the Lost Ark (1981) and Star Wars Episode IV


In [125]:
print '%.1f' % ( (rad_arc_star_wars_count/float(star_wars_count))*100.0 )


46.7

Finding top 5 movies with the highest ratings

Making a Pandas Series with the index name equal to the movie and the entry equal to the mean rating for each movie. Sliced the column names from of the movie_data dataframe from [1:] since the first column is the user id.


In [126]:
rating_means = pd.Series([movie_data[col_name].mean() for col_name in movie_data.columns[1:]], 
                         index=movie_data.columns[1:])

Printing the top 5 rated movies


In [127]:
rating_means.sort_values(ascending=False)[0:5]


Out[127]:
318: Shawshank Redemption, The (1994)             3.600000
260: Star Wars: Episode IV - A New Hope (1977)    3.266667
541: Blade Runner (1982)                          3.222222
1265: Groundhog Day (1993)                        3.166667
593: Silence of the Lambs, The (1991)             3.062500
dtype: float64

Finding top 5 movies with the most ratings

Making a Pandas Series with the index name equal to the movie and the entry equal to the number of non-Na ratings for each movie. Sliced the column names from of the movie_data dataframe from [1:] since the first column is the user id.


In [128]:
rating_count = pd.Series([movie_data[col_name].count() for col_name in movie_data.columns[1:]], 
                         index=movie_data.columns[1:])

Printing the top 5 movies with the most ratings


In [129]:
rating_count.sort_values(ascending=False)[0:5]


Out[129]:
1: Toy Story (1995)                                        17
593: Silence of the Lambs, The (1991)                      16
260: Star Wars: Episode IV - A New Hope (1977)             15
1210: Star Wars: Episode VI - Return of the Jedi (1983)    14
780: Independence Day (ID4) (1996)                         13
dtype: int64

Top 5 movies with Percentage of ratings >=4

Making a Pandas Series with the index name equal to the movie and the entry equal to the number of non-Na ratings for each movie. Sliced the column names from of the movie_data dataframe from [1:] since the first column is the user id.


In [130]:
rating_positive = pd.Series([sum(movie_data[col_name]>=4)/float(movie_data[col_name].count()) for col_name in movie_data.columns[1:]], 
                             index=movie_data.columns[1:])

Printing Top 5 movies with Percentage of ratings >=4


In [131]:
rating_positive.sort_values(ascending=False)[0:5]


Out[131]:
318: Shawshank Redemption, The (1994)             0.700000
260: Star Wars: Episode IV - A New Hope (1977)    0.533333
3578: Gladiator (2000)                            0.500000
541: Blade Runner (1982)                          0.444444
593: Silence of the Lambs, The (1991)             0.437500
dtype: float64

Top 5 movies most similar to Star Wars (movie id =260)


In [132]:
# First, storing the Star Wars ratings and the count of non-NA Star Wars ratings
star_wars_rat = movie_data["260: Star Wars: Episode IV - A New Hope (1977)"]
star_wars_count = float(movie_data["260: Star Wars: Episode IV - A New Hope (1977)"].count())
print star_wars_count


15.0

Finding Association of all movies with Star Wars Episode IV. The association with Star Wars Episode IV is defined as the number of users that rated BOTH movie i and Star Wars Episode IV divided by the number of users that rated Star Wars Episode IV. Below, we are looping over [2:] to not include Star Wars Episode IV in the Association calculation.


In [133]:
sim_val = pd.Series( [ (movie_data[col_name]*star_wars_rat).count()/star_wars_count 
                     for col_name in movie_data.columns[2:] ], index=movie_data.columns[2:] )

Printing Top 5 movies most similar to Star Wars (movie id =260)


In [134]:
sim_val.sort_values(ascending=False)[0:5]


Out[134]:
1: Toy Story (1995)                                        0.933333
1210: Star Wars: Episode VI - Return of the Jedi (1983)    0.866667
593: Silence of the Lambs, The (1991)                      0.800000
780: Independence Day (ID4) (1996)                         0.733333
2916: Total Recall (1990)                                  0.666667
dtype: float64

In [ ]: